Improvement of three simultaneous speech recognition by using AV integration and scattering theory for humanoid
نویسندگان
چکیده
This paper presents improvement of recognition of three simultaneous speeches for a humanoid robot with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speeches, two key ideas are introduced — acoustical modeling of robot head by scattering theory and two-layered audio-visual integration in both name and location, that is, speech and face recognition, and speech and face localization. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using interaural phase/intensity difference estimated by scattering theory. Since features of sounds separated by ADPF vary according to the sound direction, multiple Directionand Speaker-dependent (DS-dependent) acoustic models are used. The system integrates ASR results by using the sound direction and speaker information by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows around 10% improvement on average against recognition of three simultaneous speeches, where three talkers were located 1 meter from the humanoid and apart from each other by 0 to 90 degrees at 10-degree intervals.
منابع مشابه
Three simultaneous speech recognition by integration of active audition and face recognition for humanoid
This paper addresses listening to three simultaneous talkers by a humanoid with two microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. Huma...
متن کاملSimultaneous Speech Recognition Based on Automatic Missing Feature Mask Generation by Integrating Sound Source Separation
Our goal is to realize a humanoid robot that has the capabilities of recognizing simultaneous speech. A humanoid robot under real-world environments usually hears a mixture of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. In particular, an interface between sound source separation and speech reco...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملTitle of Dissertation : CORTICAL DYNAMICS OF AUDITORY - VISUAL SPEECH : A FORWARD MODEL OF MULTISENSORY INTEGRATION
Title of Dissertation: CORTICAL DYNAMICS OF AUDITORYVISUAL SPEECH: A FORWARD MODEL OF MULTISENSORY INTEGRATION. Virginie van Wassenhove, Ph.D., 2004 Dissertation Directed By: David Poeppel, Ph.D., Department of Linguistics, Department of Biology, Neuroscience and Cognitive Science Program In noisy settings, seeing the interlocutor’s face helps to disambiguate what is being said. For this to hap...
متن کاملروشی جدید در بازشناسی مقاوم گفتار مبتنی بر دادگان مفقود با استفاده از شبکه عصبی دوسویه
Performance of speech recognition systems is greatly reduced when speech corrupted by noise. One common method for robust speech recognition systems is missing feature methods. In this way, the components in time - frequency representation of signal (Spectrogram) that present low signal to noise ratio (SNR), are tagged as missing and deleted then replaced by remained components and statistical ...
متن کامل